A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 100 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.
The data on the 100 customers in the sample is as follows:
Answer the following questions using complete sentences and attach all output, plots, etc. within this report.
insurance <- read.csv("../CodingAssignment03/Insurance_Data_Group9.csv")
gt(head(insurance)) # the gt function only makes it look nicer
| Charges | Age | BMI | Female | Children | Smoker | WinterSprings | WinterPark | Oviedo |
|---|---|---|---|---|---|---|---|---|
| 7742.110 | 46 | 26.620 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2026.974 | 21 | 39.490 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2196.473 | 18 | 25.080 | 1 | 0 | 0 | 1 | 0 | 0 |
| 40720.551 | 46 | 30.495 | 0 | 3 | 1 | 0 | 0 | 0 |
| 9877.608 | 51 | 37.730 | 1 | 1 | 0 | 0 | 1 | 0 |
| 11987.168 | 55 | 33.880 | 0 | 3 | 0 | 0 | 1 | 0 |
Randomly select 30 observations from the sample and exclude from all modeling. Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 70 observations.
set.seed(123457)
exclude <- sample(nrow(insurance), 30)
train <- insurance[-exclude,] # data for 30 excluded observations
test <- insurance[exclude,] # data for 70 included observations
train %>%
select(Charges, Age, BMI, Children) %>%
tbl_summary(statistic = list(all_continuous() ~ c("{mean}", # Mean
"{sd}", # Standard Deviation
"{median}", # Median
"{min}", # Minimum
"{max}" # Maximum
)
),
type = all_continuous() ~ "continuous2" # Enhanced formatting for continuous variables
)
| Characteristic | N = 701 |
|---|---|
| Charges | |
| Mean | 13,375 |
| SD | 12,237 |
| Median | 9,570 |
| Min | 1,136 |
| Max | 51,195 |
| Age | |
| Mean | 41 |
| SD | 14 |
| Median | 44 |
| Min | 18 |
| Max | 64 |
| BMI | |
| Mean | 31.1 |
| SD | 5.7 |
| Median | 30.8 |
| Min | 16.0 |
| Max | 47.7 |
| Children | |
| 0 | 27 (39%) |
| 1 | 15 (21%) |
| 2 | 16 (23%) |
| 3 | 9 (13%) |
| 4 | 2 (2.9%) |
| 5 | 1 (1.4%) |
| 1 n (%) | |
Provide the correlation between all quantitative variables
cor(train[, c("Charges", "Age", "BMI", "Children")])
## Charges Age BMI Children
## Charges 1.0000000 0.2669529 0.2394231 0.2497586
## Age 0.2669529 1.0000000 0.2372201 0.2987257
## BMI 0.2394231 0.2372201 1.0000000 0.1781634
## Children 0.2497586 0.2987257 0.1781634 1.0000000
Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?
allvar <- lm(Charges ~ Age + BMI + Female + Children + Smoker + WinterPark + WinterSprings + Oviedo, data = train)
summary(allvar)
##
## Call:
## lm(formula = Charges ~ Age + BMI + Female + Children + Smoker +
## WinterPark + WinterSprings + Oviedo, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11292.8 -2542.3 18.9 3010.9 18364.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15510.23 3872.49 -4.005 0.000171 ***
## Age 236.62 47.33 4.999 5.15e-06 ***
## BMI 457.04 117.63 3.886 0.000254 ***
## Female -1722.18 1286.24 -1.339 0.185562
## Children 1401.17 546.30 2.565 0.012799 *
## Smoker 23987.60 1550.21 15.474 < 2e-16 ***
## WinterPark -3948.67 1775.32 -2.224 0.029846 *
## WinterSprings -382.03 1950.68 -0.196 0.845382
## Oviedo -309.50 1913.52 -0.162 0.872042
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5074 on 61 degrees of freedom
## Multiple R-squared: 0.848, Adjusted R-squared: 0.8281
## F-statistic: 42.54 on 8 and 61 DF, p-value: < 2.2e-16
plot(allvar)
Yes, the plots above show some of the Gauss-Markov assumptions have been violated.
The ‘Residuals versus Fitted’ graph shows that there is a non-linear relationship between the variables. This violates assumption 3 of the Gauss-Markov theorem, which states that independent variables are not correlated with error. This nonlinear relationship also show that there is specification bias.
The second plot titled “Normal Q-Q” shows the assumption of a normally distributed dependent variable for a fixed set of predictors. The line shown is fairly a 45-degree line upwards, although there are some observations which fall above and below the line. This makes it non-linear, which violates 6th assumption of the Gauss-Markov assumptions.
Finally, the third plot titled “Scale-Location” checks for homoskedasticity, which is shown on this graph and violates assumption 4 of the Gauss-Markov theorem because the data spreads apart as time increases.If this assumption were not violated, you’d see random points around a horizontal line. In this case, the line is not horizontal and the results tend to trend downward sloping, with a slight “fanning out” effect.
The last plot “Residuals vs. Leverage” keeps an eye out for regression outliers, influential observations, and high leverage points. There currently aren’t any outliers which fall outside Cook’s distance.
These should be solutions for the Gauss Markov violations:
For non-linearity, you can transform certain independent or dependent variables by using log transformations, or square root transformations. Due to the spread of data for insurance charges, taking the logarithmic function of Charges will allow us to evaluate the data by making it linear as well as stabilize variance for Heteroscedasticity.
You also could modify the model to include dummy variables that allow for different intercepts for each group, it could offer a more accurate representation of the data and alleviate the issues caused by heteroscedasticity.Most insurance do not count the number of children in a family, there are typically two types of insurance options, ‘single’ and ‘family’. So transforming the variable of Children from a quantitative to a dummy variable helps to better categorize the variable and evaluate the data.
Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?
train$lnCharges <- log(train$Charges)
hist(train$lnCharges) #after for log Charges
summary(train$lnCharges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.036 8.625 9.166 9.129 9.671 10.843
hist(train$Charges) #before
train$ChargesSquared <- train$Charges^2 # charges squared
hist(train$ChargesSquared) #after
summary(train$ChargesSquared)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.291e+06 3.165e+07 9.162e+07 3.265e+08 2.517e+08 2.621e+09
scatterplotMatrix(train[c(10,2:3,5)]) # lnCharges
par(mfrow=c(1,2)) # Charges side by side
As seen through the histograms above, taking the log of charges will bring outliers closer to other charges. Using charges without transforming the data results in data that is skewed left, and transforming using the quadratic form of charges results in data that is drastically skewed left. Including the transformation for charges squared illustrates the inappropriate nature of squaring the dependent variable in these cases. After transforming charges into the log function, the histogram above shows that the log version brings charges closer to a normal distribution.
Below, we will take both the log and quadratic forms of multiple independent variables to determine better fits relative to the original linear form, which was found to be problematic.
train$ChildrenSquared <- train$Children^2 # Children squared
hist(train$ChildrenSquared) #after
summary(train$ChildrenSquared)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 1.0 3.1 4.0 25.0
train$lnChildren <- log(train$Children) #log of children
hist(train$lnChildren) #after
summary(train$lnChildren)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf -Inf 0.0000 -Inf 0.6931 1.6094
scatterplotMatrix(train[c(10,2:3,12)]) # lnCharges with lnChildren
par(mfrow=c(1,2)) # Charts side by side
hist(train$Age) #before
train$lnAge <- log(train$Age) # log of age
hist(train$lnAge) #after
summary(train$lnAge)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.890 3.305 3.773 3.636 3.912 4.159
hist(train$Age) #before
train$AgeSquared <- train$Age^2 # age squared
hist(train$AgeSquared) #after
summary(train$lnAge)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.890 3.305 3.773 3.636 3.912 4.159
hist(train$BMI) #before
train$lnBMI <- log(train$BMI) #log of BMI
hist(train$lnBMI) #after
summary(train$lnBMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.770 3.288 3.427 3.419 3.531 3.866
train$BMISquared <- train$BMI^2 # quadratic of BMI
hist(train$BMISquared) #after
summary(train$BMISquared)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 254.7 717.5 947.6 996.7 1167.9 2279.1
Based on the transformation above, we will create models that reflect both quadratic and linear forms for multiple independent variables.
#Model 1
model_1 <- lm(lnCharges ~., data = train[,c(10,2:9)] ) #pulling only columns I want
summary(model_1)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:9)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7258 -0.1434 0.0042 0.1503 1.4266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.572798 0.255273 25.748 <2e-16 ***
## Age 0.035493 0.003120 11.376 <2e-16 ***
## BMI 0.020083 0.007754 2.590 0.0120 *
## Female 0.121214 0.084788 1.430 0.1579
## Children 0.088441 0.036012 2.456 0.0169 *
## Smoker 1.656425 0.102189 16.209 <2e-16 ***
## WinterSprings 0.007457 0.128588 0.058 0.9539
## WinterPark -0.126200 0.117029 -1.078 0.2851
## Oviedo -0.028646 0.126139 -0.227 0.8211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3345 on 61 degrees of freedom
## Multiple R-squared: 0.8794, Adjusted R-squared: 0.8636
## F-statistic: 55.6 on 8 and 61 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_1)
#ln charges
#Model 2
model_2 <- lm(lnCharges ~., data = train[,c(10,3:9,15)] ) #pulling only columns I want
summary(model_2)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 3:9, 15)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87652 -0.16920 0.02645 0.16491 1.41726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.189e+00 2.696e-01 26.663 < 2e-16 ***
## BMI 2.127e-02 8.500e-03 2.503 0.01501 *
## Female 1.304e-01 9.298e-02 1.402 0.16598
## Children 1.065e-01 3.919e-02 2.718 0.00854 **
## Smoker 1.663e+00 1.122e-01 14.823 < 2e-16 ***
## WinterSprings 6.428e-03 1.410e-01 0.046 0.96380
## WinterPark -1.267e-01 1.284e-01 -0.987 0.32778
## Oviedo -2.027e-02 1.386e-01 -0.146 0.88417
## AgeSquared 4.115e-04 4.172e-05 9.862 3.02e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3669 on 61 degrees of freedom
## Multiple R-squared: 0.8549, Adjusted R-squared: 0.8359
## F-statistic: 44.92 on 8 and 61 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_2)
#lnCharges and AgeSquared
#Model 3
model_3 <- lm(lnCharges ~., data = train[,c(10,14,3:9)] ) #pulling only columns I want
summary(model_3)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 14, 3:9)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55977 -0.14355 0.02305 0.10310 1.37177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.188562 0.404219 7.888 6.85e-11 ***
## lnAge 1.339429 0.106512 12.575 < 2e-16 ***
## BMI 0.019762 0.007223 2.736 0.00814 **
## Female 0.109889 0.079076 1.390 0.16968
## Children 0.072828 0.033830 2.153 0.03530 *
## Smoker 1.631300 0.095059 17.161 < 2e-16 ***
## WinterSprings 0.002113 0.119861 0.018 0.98599
## WinterPark -0.130848 0.109018 -1.200 0.23468
## Oviedo -0.040374 0.117440 -0.344 0.73219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3118 on 61 degrees of freedom
## Multiple R-squared: 0.8952, Adjusted R-squared: 0.8815
## F-statistic: 65.13 on 8 and 61 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_3)
#lnCharges and lnAge
#Model 4
model_4 <- lm(lnCharges ~., data = train[,c(10,2,4:9,17)] ) #pulling only columns I want
summary(model_4)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 17)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70989 -0.16308 0.00825 0.14457 1.43884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.886e+00 1.816e-01 37.926 <2e-16 ***
## Age 3.573e-02 3.106e-03 11.503 <2e-16 ***
## Female 1.227e-01 8.492e-02 1.445 0.1536
## Children 8.775e-02 3.610e-02 2.431 0.0180 *
## Smoker 1.656e+00 1.023e-01 16.188 <2e-16 ***
## WinterSprings 6.694e-05 1.289e-01 0.001 0.9996
## WinterPark -1.288e-01 1.176e-01 -1.096 0.2776
## Oviedo -2.688e-02 1.262e-01 -0.213 0.8321
## BMISquared 3.049e-04 1.191e-04 2.561 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3349 on 61 degrees of freedom
## Multiple R-squared: 0.8791, Adjusted R-squared: 0.8633
## F-statistic: 55.46 on 8 and 61 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_4)
#lnCharges and BMISquared
#Model 5
model_5 <- lm(lnCharges ~., data = train[,c(10,2,4:9,16)] ) #pulling only columns I want
summary(model_5)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 16)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73920 -0.13134 0.00264 0.13730 1.41835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.101250 0.784230 6.505 1.63e-08 ***
## Age 0.035289 0.003141 11.236 < 2e-16 ***
## Female 0.118779 0.084788 1.401 0.1663
## Children 0.089731 0.035970 2.495 0.0153 *
## Smoker 1.655747 0.102238 16.195 < 2e-16 ***
## WinterSprings 0.017437 0.128540 0.136 0.8925
## WinterPark -0.120176 0.116477 -1.032 0.3063
## Oviedo -0.029406 0.126237 -0.233 0.8166
## lnBMI 0.613866 0.238249 2.577 0.0124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3346 on 61 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8634
## F-statistic: 55.53 on 8 and 61 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_5)
#lnCharges and lnBMI
#Model 6
model_6 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,12,5)] ) #pulling only columns I want
summary(model_6)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 12,
## 5)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70026 -0.14405 -0.00955 0.13481 1.37468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.620186 0.259924 25.470 < 2e-16 ***
## Age 0.035077 0.003150 11.136 3.11e-16 ***
## BMI 0.018474 0.007929 2.330 0.0232 *
## Female 0.132595 0.085614 1.549 0.1267
## Smoker 1.650975 0.102378 16.126 < 2e-16 ***
## WinterSprings -0.007636 0.129557 -0.059 0.9532
## WinterPark -0.137528 0.117643 -1.169 0.2470
## Oviedo -0.055094 0.129051 -0.427 0.6710
## ChildrenSquared -0.024082 0.024630 -0.978 0.3321
## Children 0.171712 0.092472 1.857 0.0682 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3346 on 60 degrees of freedom
## Multiple R-squared: 0.8813, Adjusted R-squared: 0.8635
## F-statistic: 49.49 on 9 and 60 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_6)
#lnCharges and ChildrenSquared
#Model 7
#make children a dummy variable
train$childdummy <- 0
train$childdummy[train$Children > 0] <- 1
model_7 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,18)] ) #pulling only columns I want
summary(model_7)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 18)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65227 -0.13093 -0.01566 0.13835 1.37804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.642231 0.251498 26.411 < 2e-16 ***
## Age 0.035427 0.003043 11.644 < 2e-16 ***
## BMI 0.016919 0.007792 2.171 0.03380 *
## Female 0.133797 0.083224 1.608 0.11307
## Smoker 1.663931 0.100444 16.566 < 2e-16 ***
## WinterSprings -0.023576 0.121822 -0.194 0.84719
## WinterPark -0.161530 0.112590 -1.435 0.15649
## Oviedo -0.068542 0.123236 -0.556 0.58012
## childdummy 0.263374 0.090093 2.923 0.00485 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3284 on 61 degrees of freedom
## Multiple R-squared: 0.8838, Adjusted R-squared: 0.8685
## F-statistic: 57.97 on 8 and 61 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_7)
#dummy variable children
#Model 8
model_8 <- lm(lnCharges ~., data = train[,c(10,2:9,15)] ) #pulling only columns I want
summary(model_8)
##
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:9, 15)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58721 -0.15275 0.02552 0.10953 1.38677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8158862 0.3887536 14.960 < 2e-16 ***
## Age 0.0799787 0.0179957 4.444 3.87e-05 ***
## BMI 0.0201024 0.0074383 2.703 0.00894 **
## Female 0.1124223 0.0814137 1.381 0.17244
## Children 0.0727190 0.0351110 2.071 0.04266 *
## Smoker 1.6373813 0.0983248 16.653 < 2e-16 ***
## WinterSprings 0.0070994 0.1233556 0.058 0.95430
## WinterPark -0.1373286 0.1123543 -1.222 0.22638
## Oviedo -0.0534739 0.1214107 -0.440 0.66120
## AgeSquared -0.0005500 0.0002194 -2.507 0.01490 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3209 on 60 degrees of freedom
## Multiple R-squared: 0.8908, Adjusted R-squared: 0.8745
## F-statistic: 54.4 on 9 and 60 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_2)
#lnCharges and Age + AgeSquared
Use the 30 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)
bad_model <- lm(Charges ~., data = insurance)
summary(bad_model)
##
## Call:
## lm(formula = Charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12267.4 -2639.4 -112.3 2385.5 19603.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13823.02 3448.79 -4.008 0.000125 ***
## Age 251.26 45.76 5.491 3.58e-07 ***
## BMI 394.14 100.36 3.927 0.000167 ***
## Female -718.53 1189.05 -0.604 0.547154
## Children 1075.41 534.90 2.010 0.047340 *
## Smoker 24707.50 1343.92 18.385 < 2e-16 ***
## WinterSprings -1338.49 1727.60 -0.775 0.440484
## WinterPark -3937.16 1578.52 -2.494 0.014429 *
## Oviedo -796.89 1769.67 -0.450 0.653561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5602 on 91 degrees of freedom
## Multiple R-squared: 0.8194, Adjusted R-squared: 0.8036
## F-statistic: 51.62 on 8 and 91 DF, p-value: < 2.2e-16
test$lnCharges <- log(test$Charges)
test$lnAge <- log(test$Age)
test$lnBMI <- log(test$BMI)
test$bad_model_pred <- predict(bad_model, newdata = test)
test$lnCharges <- log(test$Charges)
test$lnage <- log(test$Age)
test$lnBMI <- log(test$BMI)
test$model_3_pred <- predict(model_3, newdata = test) %>% exp()
#lnAge
test$model_5_pred <- predict(model_5, newdata = test) %>% exp()
#lnBMI
# Finding the error
test$error_bm <- test$bad_model_pred - test$Charges
test$error_3 <- test$model_3_pred - test$Charges
test$error_5 <- test$model_5_pred - test$Charges
# Bad Model
mean(test$error_bm)
## [1] -381.4727
# Model 1
mean(test$error_3)
## [1] -1048.058
# Model 2
mean(test$error_5)
## [1] -1729.154
# Function to calculate MAE
mae <- function(error_vector){
error_vector %>%
abs() %>%
mean()
}
# Bad Model
mae(test$error_bm)
## [1] 4582.452
# Model 1
mae(test$error_3)
## [1] 3813.108
# Model 2
mae(test$error_5)
## [1] 3933.514
rmse <- function(error_vector){
error_vector^2 %>%
mean() %>%
sqrt()
}
# Bad Model
rmse(test$error_bm)
## [1] 6428.387
# Model 1
rmse(test$error_3)
## [1] 6747.818
# Model 2
rmse(test$error_5)
## [1] 6804.108
mape <- function(error_vector, actual_vector){
(error_vector/actual_vector) %>%
abs() %>%
mean()
}
# Bad Model
mape(test$error_bm, test$saleprice)
## [1] NaN
# Model 1
mape(test$error_3, test$saleprice)
## [1] NaN
# Model 2
mape(test$error_5, test$saleprice)
## [1] NaN
All things considered, though Model 2 has the lowest MAE, Model 1 is the least biased and has the lowest RMSE. Therefore, it would be the best model to use long-term.
Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.
-The summary regression output from question 5 shows that Age, BMI, Children, Smoker, as well as the default Sanford all having a direct correlation with charges. The variables Female, WinterSprings, WinterPark, and Oviedo all have an inverse effect on charges. All of the signs make sense due to the fact of the increased health risks associated with many of the variables that have a direct relationship with health expenses.
-For the variable Age, each additional year adds an expected $251.26 in medical expenses, For every increase in 1 in BMI score, there is an expected increase of $394.14 in medical expenses. If the customer is a Female, there is an expected $718.53 decrease in medical expenses in comparison to a male. If the customer is a child, there is an expected $1075.41 increase in medical expenses in comparison to a non-child. If the customer is a smoker, there is an expected increase of $24707.50 in medical expenses compared to a non-smoker. If the location of the customer is in Winter Springs, there is an expected lower medical expense of $338.49 in comparison to the default location of Sanford. If the location of the customer is Winter Park, there is an expected lower medical expense of $3937.16 in comparison to the default of Sanford. If the location of the customer is Oviedo, there is an expected lower medical expense of $796.89 in comparison to the default of Sanford.
An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.
| Customer | Age | BMI | Female | Children | Smoker | City |
|---|---|---|---|---|---|---|
| 1 | 60 | 22 | 1 | 0 | 0 | Oviedo |
| 2 | 40 | 30 | 0 | 1 | 0 | Sanford |
| 3 | 25 | 25 | 0 | 0 | 1 | Winter Park |
| 4 | 33 | 35 | 1 | 2 | 0 | Winter Springs |
| 5 | 45 | 27 | 1 | 3 | 0 | Oviedo |
#
name <- c("Customer", "Age", "BMI","Female","Children","Smoker","City")
Customer <- c("1","2","3","4","5")
Age <- c(60,40,25,33,45)
BMI <- c(22,30,25,35,27)
Female <- c(1,0,0,1,1)
childdummy <- c(0,1,0,2,3)
Smoker <- c(0,0,1,0,0)
WinterSprings <- c(0,0,0,1,0)
WinterPark <- c(0,0,1,0,0)
Oviedo <- c(1,0,0,0,1)
Sanford <- c(0,0,0,0,0)
#
Potential_Clients <-data.frame(Customer,Age,BMI,Female,childdummy,Smoker,WinterSprings,WinterPark,Oviedo,Sanford)
#
predict(model_7, newdata = Potential_Clients,interval = "prediction")
## fit lwr upr
## 1 9.205349 8.482255 9.928443
## 2 8.830276 8.143659 9.516892
## 3 9.453296 8.754262 10.152331
## 4 9.040476 8.297437 9.783515
## 5 9.548658 8.744717 10.352600
The owner notices that some of the predictions are wider than others, explain why.
-The prediction and confidence outputs reveal that the prediction interval has wider boundaries than the confidence interval, which is expected. This difference arises because the prediction interval accounts for both the uncertainty in estimating the population mean and the random variation of individual values.
-In this case, some prediction intervals are wider than others due to the use of independent variable values that are significantly larger or smaller than average, placing them closer to the bounds of the range. For instance, comparing the prediction intervals for row 1 (which uses the largest age and lowest BMI) and row 3 (which uses the same values for age and BMI) highlights this variation. Row 1 has a wider prediction interval of 10.01 - 8.11 = 1.90, while row 3 has a slightly narrower prediction interval of 10.44 - 8.62 = 1.82. This difference illustrates how the input data can influence the width of the prediction intervals.
Are there any prediction problems that occur with the five potential clients? If so, explain.
-Prediction problems often involve several types of errors, including extrapolation, model bias, high variance, and the effects of non-linearity and interactions. In this context, extrapolation error is unlikely to be a concern since the range of values for each independent variable in Model 7’s training data set adequately covers the range needed for the five clients in terms of age, BMI, number of children, and other factors. However, the model exhibits a negative bias, suggesting that predictions are likely to underestimate actual values.
-Another critical consideration is the sample size. Model 7’s training dataset includes only 30 observations, which is relatively small. A larger sample size, such as 100 or more, would provide a more accurate representation and improve the model’s reliability. Additionally, when using a model with a log-transformed dependent variable like lncharges, it is crucial to back-transform the results to reflect the true dollar amounts for accurate interpretation.